Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
1.
EBioMedicine ; 102: 105075, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38565004

RESUMO

BACKGROUND: AI models have shown promise in performing many medical imaging tasks. However, our ability to explain what signals these models have learned is severely lacking. Explanations are needed in order to increase the trust of doctors in AI-based models, especially in domains where AI prediction capabilities surpass those of humans. Moreover, such explanations could enable novel scientific discovery by uncovering signals in the data that aren't yet known to experts. METHODS: In this paper, we present a workflow for generating hypotheses to understand which visual signals in images are correlated with a classification model's predictions for a given task. This approach leverages an automatic visual explanation algorithm followed by interdisciplinary expert review. We propose the following 4 steps: (i) Train a classifier to perform a given task to assess whether the imagery indeed contains signals relevant to the task; (ii) Train a StyleGAN-based image generator with an architecture that enables guidance by the classifier ("StylEx"); (iii) Automatically detect, extract, and visualize the top visual attributes that the classifier is sensitive towards. For visualization, we independently modify each of these attributes to generate counterfactual visualizations for a set of images (i.e., what the image would look like with the attribute increased or decreased); (iv) Formulate hypotheses for the underlying mechanisms, to stimulate future research. Specifically, present the discovered attributes and corresponding counterfactual visualizations to an interdisciplinary panel of experts so that hypotheses can account for social and structural determinants of health (e.g., whether the attributes correspond to known patho-physiological or socio-cultural phenomena, or could be novel discoveries). FINDINGS: To demonstrate the broad applicability of our approach, we present results on eight prediction tasks across three medical imaging modalities-retinal fundus photographs, external eye photographs, and chest radiographs. We showcase examples where many of the automatically-learned attributes clearly capture clinically known features (e.g., types of cataract, enlarged heart), and demonstrate automatically-learned confounders that arise from factors beyond physiological mechanisms (e.g., chest X-ray underexposure is correlated with the classifier predicting abnormality, and eye makeup is correlated with the classifier predicting low hemoglobin levels). We further show that our method reveals a number of physiologically plausible, previously-unknown attributes based on the literature (e.g., differences in the fundus associated with self-reported sex, which were previously unknown). INTERPRETATION: Our approach enables hypotheses generation via attribute visualizations and has the potential to enable researchers to better understand, improve their assessment, and extract new knowledge from AI-based models, as well as debug and design better datasets. Though not designed to infer causality, importantly, we highlight that attributes generated by our framework can capture phenomena beyond physiology or pathophysiology, reflecting the real world nature of healthcare delivery and socio-cultural factors, and hence interdisciplinary perspectives are critical in these investigations. Finally, we will release code to help researchers train their own StylEx models and analyze their predictive tasks of interest, and use the methodology presented in this paper for responsible interpretation of the revealed attributes. FUNDING: Google.


Assuntos
Algoritmos , Catarata , Humanos , Cardiomegalia , Fundo de Olho , Inteligência Artificial
2.
Radiol Artif Intell ; : e230079, 2024 Mar 13.
Artigo em Inglês | MEDLINE | ID: mdl-38477661

RESUMO

"Just Accepted" papers have undergone full peer review and have been accepted for publication in Radiology: Artificial Intelligence. This article will undergo copyediting, layout, and proof review before it is published in its final version. Please note that during production of the final copyedited article, errors may be discovered which could affect the content. Purpose To evaluate the impact of an artificial intelligence (AI) assistant for lung cancer screening (LCS) on multinational clinical workflows. Materials and Methods An AI assistant for LCS was evaluated on two retrospective randomized multireader multicase studies, where 627 (141 cancer positive) low-dose chest CT cases were each read twice (with and without AI assistance) by experienced thoracic radiologists (6 US-based or 6 Japan-based), resulting in a total of 7,524 interpretations. Positive cases were defined as those within two years before a pathology-confirmed lung cancer diagnosis. Negative cases were defined as those without any subsequent cancer diagnosis for at least two years and were enriched for a spectrum of diverse nodules. The studies measured the readers' level of suspicion (LoS, on a 0-100 scale), country-specific screening system scoring categories, and management recommendations. Evaluation metrics included the area under the receiver operating characteristic curve (AUC) for LoS and sensitivity and specificity of recall recommendations. Results With AI assistance, the radiologists' AUC increased by 0.023 (0.70 to 0.72, P = .02) for the US study and by 0.023 (0.93 to 0.96, P = .18) for the Japan study. Scoring system specificity for actionable findings increased 5.5% (57%-63%, P < .001) for the US study and 6.7% (23%-30%, P < .001) for the Japan study. There was no evidence of a difference in corresponding sensitivity between unassisted and AI-assisted reads for the US (67.3%-67.5%, P = .88) and Japan (98%-100%, P > .99) studies. Corresponding standalone AI AUC system performance was 0.75 95% CI [0.70-0.81] and 0.88 95%CI [0.78-0.97] for the US and Japan-based datasets, respectively. Conclusion The concurrent AI interface improved LCS specificity in both US and Japan-based reader studies, meriting further study in additional international screening environments. ©RSNA, 2024.

3.
JAMA ; 331(3): 242-244, 2024 01 16.
Artigo em Inglês | MEDLINE | ID: mdl-38227029

RESUMO

Importance: Interest in artificial intelligence (AI) has reached an all-time high, and health care leaders across the ecosystem are faced with questions about where, when, and how to deploy AI and how to understand its risks, problems, and possibilities. Observations: While AI as a concept has existed since the 1950s, all AI is not the same. Capabilities and risks of various kinds of AI differ markedly, and on examination 3 epochs of AI emerge. AI 1.0 includes symbolic AI, which attempts to encode human knowledge into computational rules, as well as probabilistic models. The era of AI 2.0 began with deep learning, in which models learn from examples labeled with ground truth. This era brought about many advances both in people's daily lives and in health care. Deep learning models are task-specific, meaning they do one thing at a time, and they primarily focus on classification and prediction. AI 3.0 is the era of foundation models and generative AI. Models in AI 3.0 have fundamentally new (and potentially transformative) capabilities, as well as new kinds of risks, such as hallucinations. These models can do many different kinds of tasks without being retrained on a new dataset. For example, a simple text instruction will change the model's behavior. Prompts such as "Write this note for a specialist consultant" and "Write this note for the patient's mother" will produce markedly different content. Conclusions and Relevance: Foundation models and generative AI represent a major revolution in AI's capabilities, ffering tremendous potential to improve care. Health care leaders are making decisions about AI today. While any heuristic omits details and loses nuance, the framework of AI 1.0, 2.0, and 3.0 may be helpful to decision-makers because each epoch has fundamentally different capabilities and risks.


Assuntos
Inteligência Artificial , Atenção à Saúde , Humanos , Inteligência Artificial/classificação , Inteligência Artificial/história , Tomada de Decisões , Atenção à Saúde/história , História do Século XX , História do Século XXI
4.
Lancet Digit Health ; 6(2): e126-e130, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38278614

RESUMO

Advances in machine learning for health care have brought concerns about bias from the research community; specifically, the introduction, perpetuation, or exacerbation of care disparities. Reinforcing these concerns is the finding that medical images often reveal signals about sensitive attributes in ways that are hard to pinpoint by both algorithms and people. This finding raises a question about how to best design general purpose pretrained embeddings (GPPEs, defined as embeddings meant to support a broad array of use cases) for building downstream models that are free from particular types of bias. The downstream model should be carefully evaluated for bias, and audited and improved as appropriate. However, in our view, well intentioned attempts to prevent the upstream components-GPPEs-from learning sensitive attributes can have unintended consequences on the downstream models. Despite producing a veneer of technical neutrality, the resultant end-to-end system might still be biased or poorly performing. We present reasons, by building on previously published data, to support the reasoning that GPPEs should ideally contain as much information as the original data contain, and highlight the perils of trying to remove sensitive attributes from a GPPE. We also emphasise that downstream prediction models trained for specific tasks and settings, whether developed using GPPEs or not, should be carefully designed and evaluated to avoid bias that makes models vulnerable to issues such as distributional shift. These evaluations should be done by a diverse team, including social scientists, on a diverse cohort representing the full breadth of the patient population for which the final model is intended.


Assuntos
Atenção à Saúde , Aprendizado de Máquina , Humanos , Viés , Algoritmos
5.
Transl Vis Sci Technol ; 12(12): 11, 2023 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-38079169

RESUMO

Purpose: Real-world evaluation of a deep learning model that prioritizes patients based on risk of progression to moderate or worse (MOD+) diabetic retinopathy (DR). Methods: This nonrandomized, single-arm, prospective, interventional study included patients attending DR screening at four centers across Thailand from September 2019 to January 2020, with mild or no DR. Fundus photographs were input into the model, and patients were scheduled for their subsequent screening from September 2020 to January 2021 in order of predicted risk. Evaluation focused on model sensitivity, defined as correctly ranking patients that developed MOD+ within the first 50% of subsequent screens. Results: We analyzed 1,757 patients, of which 52 (3.0%) developed MOD+. Using the model-proposed order, the model's sensitivity was 90.4%. Both the model-proposed order and mild/no DR plus HbA1c had significantly higher sensitivity than the random order (P < 0.001). Excluding one major (rural) site that had practical implementation challenges, the remaining sites included 567 patients and 15 (2.6%) developed MOD+. Here, the model-proposed order achieved 86.7% versus 73.3% for the ranking that used DR grade and hemoglobin A1c. Conclusions: The model can help prioritize follow-up visits for the largest subgroups of DR patients (those with no or mild DR). Further research is needed to evaluate the impact on clinical management and outcomes. Translational Relevance: Deep learning demonstrated potential for risk stratification in DR screening. However, real-world practicalities must be resolved to fully realize the benefit.


Assuntos
Aprendizado Profundo , Diabetes Mellitus , Retinopatia Diabética , Humanos , Retinopatia Diabética/diagnóstico , Retinopatia Diabética/epidemiologia , Estudos Prospectivos , Hemoglobinas Glicadas , Medição de Risco
7.
Nature ; 620(7972): 172-180, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37438534

RESUMO

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.


Assuntos
Benchmarking , Simulação por Computador , Conhecimento , Medicina , Processamento de Linguagem Natural , Viés , Competência Clínica , Compreensão , Conjuntos de Dados como Assunto , Licenciamento , Medicina/métodos , Medicina/normas , Segurança do Paciente , Médicos
8.
Nat Biomed Eng ; 7(6): 756-779, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37291435

RESUMO

Machine-learning models for medical tasks can match or surpass the performance of clinical experts. However, in settings differing from those of the training dataset, the performance of a model can deteriorate substantially. Here we report a representation-learning strategy for machine-learning models applied to medical-imaging tasks that mitigates such 'out of distribution' performance problem and that improves model robustness and training efficiency. The strategy, which we named REMEDIS (for 'Robust and Efficient Medical Imaging with Self-supervision'), combines large-scale supervised transfer learning on natural images and intermediate contrastive self-supervised learning on medical images and requires minimal task-specific customization. We show the utility of REMEDIS in a range of diagnostic-imaging tasks covering six imaging domains and 15 test datasets, and by simulating three realistic out-of-distribution scenarios. REMEDIS improved in-distribution diagnostic accuracies up to 11.5% with respect to strong supervised baseline models, and in out-of-distribution settings required only 1-33% of the data for retraining to match the performance of supervised models retrained using all available data. REMEDIS may accelerate the development lifecycle of machine-learning models for medical imaging.


Assuntos
Aprendizado de Máquina , Aprendizado de Máquina Supervisionado , Diagnóstico por Imagem
10.
Commun Med (Lond) ; 3(1): 59, 2023 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-37095223

RESUMO

BACKGROUND: Presence of lymph node metastasis (LNM) influences prognosis and clinical decision-making in colorectal cancer. However, detection of LNM is variable and depends on a number of external factors. Deep learning has shown success in computational pathology, but has struggled to boost performance when combined with known predictors. METHODS: Machine-learned features are created by clustering deep learning embeddings of small patches of tumor in colorectal cancer via k-means, and then selecting the top clusters that add predictive value to a logistic regression model when combined with known baseline clinicopathological variables. We then analyze performance of logistic regression models trained with and without these machine-learned features in combination with the baseline variables. RESULTS: The machine-learned extracted features provide independent signal for the presence of LNM (AUROC: 0.638, 95% CI: [0.590, 0.683]). Furthermore, the machine-learned features add predictive value to the set of 6 clinicopathologic variables in an external validation set (likelihood ratio test, p < 0.00032; AUROC: 0.740, 95% CI: [0.701, 0.780]). A model incorporating these features can also further risk-stratify patients with and without identified metastasis (p < 0.001 for both stage II and stage III). CONCLUSION: This work demonstrates an effective approach to combine deep learning with established clinicopathologic factors in order to identify independently informative features associated with LNM. Further work building on these specific results may have important impact in prognostication and therapeutic decision making for LNM. Additionally, this general computational approach may prove useful in other contexts.


When colorectal cancers spread to the lymph nodes, it can indicate a poorer prognosis. However, detecting lymph node metastasis (spread) can be difficult and depends on a number of factors such as how samples are taken and processed. Here, we show that machine learning, which involves computer software learning from patterns in data, can predict lymph node metastasis in patients with colorectal cancer from the microscopic appearance of their primary tumor and the clinical characteristics of the patients. We also show that the same approach can predict patient survival. With further work, our approach may help clinicians to inform patients about their prognosis and decide on appropriate treatments.

11.
JAMA Netw Open ; 6(3): e2254891, 2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36917112

RESUMO

Importance: Identifying new prognostic features in colon cancer has the potential to refine histopathologic review and inform patient care. Although prognostic artificial intelligence systems have recently demonstrated significant risk stratification for several cancer types, studies have not yet shown that the machine learning-derived features associated with these prognostic artificial intelligence systems are both interpretable and usable by pathologists. Objective: To evaluate whether pathologist scoring of a histopathologic feature previously identified by machine learning is associated with survival among patients with colon cancer. Design, Setting, and Participants: This prognostic study used deidentified, archived colorectal cancer cases from January 2013 to December 2015 from the University of Milano-Bicocca. All available histologic slides from 258 consecutive colon adenocarcinoma cases were reviewed from December 2021 to February 2022 by 2 pathologists, who conducted semiquantitative scoring for tumor adipose feature (TAF), which was previously identified via a prognostic deep learning model developed with an independent colorectal cancer cohort. Main Outcomes and Measures: Prognostic value of TAF for overall survival and disease-specific survival as measured by univariable and multivariable regression analyses. Interpathologist agreement in TAF scoring was also evaluated. Results: A total of 258 colon adenocarcinoma histopathologic cases from 258 patients (138 men [53%]; median age, 67 years [IQR, 65-81 years]) with stage II (n = 119) or stage III (n = 139) cancer were included. Tumor adipose feature was identified in 120 cases (widespread in 63 cases, multifocal in 31, and unifocal in 26). For overall survival analysis after adjustment for tumor stage, TAF was independently prognostic in 2 ways: TAF as a binary feature (presence vs absence: hazard ratio [HR] for presence of TAF, 1.55 [95% CI, 1.07-2.25]; P = .02) and TAF as a semiquantitative categorical feature (HR for widespread TAF, 1.87 [95% CI, 1.23-2.85]; P = .004). Interpathologist agreement for widespread TAF vs lower categories (absent, unifocal, or multifocal) was 90%, corresponding to a κ metric at this threshold of 0.69 (95% CI, 0.58-0.80). Conclusions and Relevance: In this prognostic study, pathologists were able to learn and reproducibly score for TAF, providing significant risk stratification on this independent data set. Although additional work is warranted to understand the biological significance of this feature and to establish broadly reproducible TAF scoring, this work represents the first validation to date of human expert learning from machine learning in pathology. Specifically, this validation demonstrates that a computationally identified histologic feature can represent a human-identifiable, prognostic feature with the potential for integration into pathology practice.


Assuntos
Adenocarcinoma , Neoplasias do Colo , Masculino , Humanos , Idoso , Neoplasias do Colo/diagnóstico , Patologistas , Inteligência Artificial , Aprendizado de Máquina , Medição de Risco
12.
Lancet Digit Health ; 5(5): e257-e264, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-36966118

RESUMO

BACKGROUND: Photographs of the external eye were recently shown to reveal signs of diabetic retinal disease and elevated glycated haemoglobin. This study aimed to test the hypothesis that external eye photographs contain information about additional systemic medical conditions. METHODS: We developed a deep learning system (DLS) that takes external eye photographs as input and predicts systemic parameters, such as those related to the liver (albumin, aspartate aminotransferase [AST]); kidney (estimated glomerular filtration rate [eGFR], urine albumin-to-creatinine ratio [ACR]); bone or mineral (calcium); thyroid (thyroid stimulating hormone); and blood (haemoglobin, white blood cells [WBC], platelets). This DLS was trained using 123 130 images from 38 398 patients with diabetes undergoing diabetic eye screening in 11 sites across Los Angeles county, CA, USA. Evaluation focused on nine prespecified systemic parameters and leveraged three validation sets (A, B, C) spanning 25 510 patients with and without diabetes undergoing eye screening in three independent sites in Los Angeles county, CA, and the greater Atlanta area, GA, USA. We compared performance against baseline models incorporating available clinicodemographic variables (eg, age, sex, race and ethnicity, years with diabetes). FINDINGS: Relative to the baseline, the DLS achieved statistically significant superior performance at detecting AST >36·0 U/L, calcium <8·6 mg/dL, eGFR <60·0 mL/min/1·73 m2, haemoglobin <11·0 g/dL, platelets <150·0 × 103/µL, ACR ≥300 mg/g, and WBC <4·0 × 103/µL on validation set A (a population resembling the development datasets), with the area under the receiver operating characteristic curve (AUC) of the DLS exceeding that of the baseline by 5·3-19·9% (absolute differences in AUC). On validation sets B and C, with substantial patient population differences compared with the development datasets, the DLS outperformed the baseline for ACR ≥300·0 mg/g and haemoglobin <11·0 g/dL by 7·3-13·2%. INTERPRETATION: We found further evidence that external eye photographs contain biomarkers spanning multiple organ systems. Such biomarkers could enable accessible and non-invasive screening of disease. Further work is needed to understand the translational implications. FUNDING: Google.


Assuntos
Aprendizado Profundo , Retinopatia Diabética , Humanos , Estudos Retrospectivos , Cálcio , Retinopatia Diabética/diagnóstico , Biomarcadores , Albuminas
13.
Radiology ; 306(1): 124-137, 2023 01.
Artigo em Inglês | MEDLINE | ID: mdl-36066366

RESUMO

Background The World Health Organization (WHO) recommends chest radiography to facilitate tuberculosis (TB) screening. However, chest radiograph interpretation expertise remains limited in many regions. Purpose To develop a deep learning system (DLS) to detect active pulmonary TB on chest radiographs and compare its performance to that of radiologists. Materials and Methods A DLS was trained and tested using retrospective chest radiographs (acquired between 1996 and 2020) from 10 countries. To improve generalization, large-scale chest radiograph pretraining, attention pooling, and semisupervised learning ("noisy-student") were incorporated. The DLS was evaluated in a four-country test set (China, India, the United States, and Zambia) and in a mining population in South Africa, with positive TB confirmed with microbiological tests or nucleic acid amplification testing (NAAT). The performance of the DLS was compared with that of 14 radiologists. The authors studied the efficacy of the DLS compared with that of nine radiologists using the Obuchowski-Rockette-Hillis procedure. Given WHO targets of 90% sensitivity and 70% specificity, the operating point of the DLS (0.45) was prespecified to favor sensitivity. Results A total of 165 754 images in 22 284 subjects (mean age, 45 years; 21% female) were used for model development and testing. In the four-country test set (1236 subjects, 17% with active TB), the receiver operating characteristic (ROC) curve of the DLS was higher than those for all nine India-based radiologists, with an area under the ROC curve of 0.89 (95% CI: 0.87, 0.91). Compared with these radiologists, at the prespecified operating point, the DLS sensitivity was higher (88% vs 75%, P < .001) and specificity was noninferior (79% vs 84%, P = .004). Trends were similar within other patient subgroups, in the South Africa data set, and across various TB-specific chest radiograph findings. In simulations, the use of the DLS to identify likely TB-positive chest radiographs for NAAT confirmation reduced the cost by 40%-80% per TB-positive patient detected. Conclusion A deep learning method was found to be noninferior to radiologists for the determination of active tuberculosis on digital chest radiographs. © RSNA, 2022 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.


Assuntos
Aprendizado Profundo , Tuberculose Pulmonar , Humanos , Feminino , Pessoa de Meia-Idade , Masculino , Radiografia Torácica/métodos , Estudos Retrospectivos , Radiografia , Tuberculose Pulmonar/diagnóstico por imagem , Radiologistas , Sensibilidade e Especificidade
14.
NPJ Breast Cancer ; 8(1): 113, 2022 Oct 04.
Artigo em Inglês | MEDLINE | ID: mdl-36192400

RESUMO

Histologic grading of breast cancer involves review and scoring of three well-established morphologic features: mitotic count, nuclear pleomorphism, and tubule formation. Taken together, these features form the basis of the Nottingham Grading System which is used to inform breast cancer characterization and prognosis. In this study, we develop deep learning models to perform histologic scoring of all three components using digitized hematoxylin and eosin-stained slides containing invasive breast carcinoma. We first evaluate model performance using pathologist-based reference standards for each component. To complement this typical approach to evaluation, we further evaluate the deep learning models via prognostic analyses. The individual component models perform at or above published benchmarks for algorithm-based grading approaches, achieving high concordance rates with pathologist grading. Further, prognostic performance using deep learning-based grading is on par with that of pathologists performing review of matched slides. By providing scores for each component feature, the deep-learning based approach also provides the potential to identify the grading components contributing most to prognostic value. This may enable optimized prognostic models, opportunities to improve access to consistent grading, and approaches to better understand the links between histologic features and clinical outcomes in breast cancer.

15.
Surg Endosc ; 36(12): 9215-9223, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-35941306

RESUMO

BACKGROUND: The potential role and benefits of AI in surgery has yet to be determined. This study is a first step in developing an AI system for minimizing adverse events and improving patient's safety. We developed an Artificial Intelligence (AI) algorithm and evaluated its performance in recognizing surgical phases of laparoscopic cholecystectomy (LC) videos spanning a range of complexities. METHODS: A set of 371 LC videos with various complexity levels and containing adverse events was collected from five hospitals. Two expert surgeons segmented each video into 10 phases including Calot's triangle dissection and clipping and cutting. For each video, adverse events were also annotated when present (major bleeding; gallbladder perforation; major bile leakage; and incidental finding) and complexity level (on a scale of 1-5) was also recorded. The dataset was then split in an 80:20 ratio (294 and 77 videos), stratified by complexity, hospital, and adverse events to train and test the AI model, respectively. The AI-surgeon agreement was then compared to the agreement between surgeons. RESULTS: The mean accuracy of the AI model for surgical phase recognition was 89% [95% CI 87.1%, 90.6%], comparable to the mean inter-annotator agreement of 90% [95% CI 89.4%, 90.5%]. The model's accuracy was inversely associated with procedure complexity, decreasing from 92% (complexity level 1) to 88% (complexity level 3) to 81% (complexity level 5). CONCLUSION: The AI model successfully identified surgical phases in both simple and complex LC procedures. Further validation and system training is warranted to evaluate its potential applications such as to increase patient safety during surgery.


Assuntos
Colecistectomia Laparoscópica , Doenças da Vesícula Biliar , Humanos , Colecistectomia Laparoscópica/métodos , Inteligência Artificial , Doenças da Vesícula Biliar/cirurgia , Dissecação
16.
Commun Med (Lond) ; 2: 40, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35603304

RESUMO

Background: Measuring vital signs plays a key role in both patient care and wellness, but can be challenging outside of medical settings due to the lack of specialized equipment. Methods: In this study, we prospectively evaluated smartphone camera-based techniques for measuring heart rate (HR) and respiratory rate (RR) for consumer wellness use. HR was measured by placing the finger over the rear-facing camera, while RR was measured via a video of the participants sitting still in front of the front-facing camera. Results: In the HR study of 95 participants (with a protocol that included both measurements at rest and post exercise), the mean absolute percent error (MAPE) ± standard deviation of the measurement was 1.6% ± 4.3%, which was significantly lower than the pre-specified goal of 5%. No significant differences in the MAPE were present across colorimeter-measured skin-tone subgroups: 1.8% ± 4.5% for very light to intermediate, 1.3% ± 3.3% for tan and brown, and 1.8% ± 4.9% for dark. In the RR study of 50 participants, the mean absolute error (MAE) was 0.78 ± 0.61 breaths/min, which was significantly lower than the pre-specified goal of 3 breaths/min. The MAE was low in both healthy participants (0.70 ± 0.67 breaths/min), and participants with chronic respiratory conditions (0.80 ± 0.60 breaths/min). Conclusions: These results validate the accuracy of our smartphone camera-based techniques to measure HR and RR across a range of pre-defined subgroups.

17.
Nat Biomed Eng ; 6(12): 1370-1383, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-35352000

RESUMO

Retinal fundus photographs can be used to detect a range of retinal conditions. Here we show that deep-learning models trained instead on external photographs of the eyes can be used to detect diabetic retinopathy (DR), diabetic macular oedema and poor blood glucose control. We developed the models using eye photographs from 145,832 patients with diabetes from 301 DR screening sites and evaluated the models on four tasks and four validation datasets with a total of 48,644 patients from 198 additional screening sites. For all four tasks, the predictive performance of the deep-learning models was significantly higher than the performance of logistic regression models using self-reported demographic and medical history data, and the predictions generalized to patients with dilated pupils, to patients from a different DR screening programme and to a general eye care programme that included diabetics and non-diabetics. We also explored the use of the deep-learning models for the detection of elevated lipid levels. The utility of external eye photographs for the diagnosis and management of diseases should be further validated with images from different cameras and patient populations.


Assuntos
Aprendizado Profundo , Retinopatia Diabética , Doenças Retinianas , Humanos , Sensibilidade e Especificidade , Retinopatia Diabética/diagnóstico por imagem , Fundo de Olho
18.
Ophthalmol Retina ; 6(5): 398-410, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-34999015

RESUMO

PURPOSE: To validate the generalizability of a deep learning system (DLS) that detects diabetic macular edema (DME) from 2-dimensional color fundus photographs (CFP), for which the reference standard for retinal thickness and fluid presence is derived from 3-dimensional OCT. DESIGN: Retrospective validation of a DLS across international datasets. PARTICIPANTS: Paired CFP and OCT of patients from diabetic retinopathy (DR) screening programs or retina clinics. The DLS was developed using data sets from Thailand, the United Kingdom, and the United States and validated using 3060 unique eyes from 1582 patients across screening populations in Australia, India, and Thailand. The DLS was separately validated in 698 eyes from 537 screened patients in the United Kingdom with mild DR and suspicion of DME based on CFP. METHODS: The DLS was trained using DME labels from OCT. The presence of DME was based on retinal thickening or intraretinal fluid. The DLS's performance was compared with expert grades of maculopathy and to a previous proof-of-concept version of the DLS. We further simulated the integration of the current DLS into an algorithm trained to detect DR from CFP. MAIN OUTCOME MEASURES: The superiority of specificity and noninferiority of sensitivity of the DLS for the detection of center-involving DME, using device-specific thresholds, compared with experts. RESULTS: The primary analysis in a combined data set spanning Australia, India, and Thailand showed the DLS had 80% specificity and 81% sensitivity, compared with expert graders, who had 59% specificity and 70% sensitivity. Relative to human experts, the DLS had significantly higher specificity (P = 0.008) and noninferior sensitivity (P < 0.001). In the data set from the United Kingdom, the DLS had a specificity of 80% (P < 0.001 for specificity of >50%) and a sensitivity of 100% (P = 0.02 for sensitivity of > 90%). CONCLUSIONS: The DLS can generalize to multiple international populations with an accuracy exceeding that of experts. The clinical value of this DLS to reduce false-positive referrals, thus decreasing the burden on specialist eye care, warrants a prospective evaluation.


Assuntos
Aprendizado Profundo , Diabetes Mellitus , Retinopatia Diabética , Edema Macular , Retinopatia Diabética/complicações , Retinopatia Diabética/diagnóstico , Humanos , Edema Macular/diagnóstico , Edema Macular/etiologia , Estudos Retrospectivos , Tomografia de Coerência Óptica/métodos , Estados Unidos
19.
Nat Med ; 28(1): 154-163, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-35027755

RESUMO

Artificial intelligence (AI) has shown promise for diagnosing prostate cancer in biopsies. However, results have been limited to individual studies, lacking validation in multinational settings. Competitions have been shown to be accelerators for medical imaging innovations, but their impact is hindered by lack of reproducibility and independent validation. With this in mind, we organized the PANDA challenge-the largest histopathology competition to date, joined by 1,290 developers-to catalyze development of reproducible AI algorithms for Gleason grading using 10,616 digitized prostate biopsies. We validated that a diverse set of submitted algorithms reached pathologist-level performance on independent cross-continental cohorts, fully blinded to the algorithm developers. On United States and European external validation sets, the algorithms achieved agreements of 0.862 (quadratically weighted κ, 95% confidence interval (CI), 0.840-0.884) and 0.868 (95% CI, 0.835-0.900) with expert uropathologists. Successful generalization across different patient populations, laboratories and reference standards, achieved by a variety of algorithmic approaches, warrants evaluating AI-based Gleason grading in prospective clinical trials.


Assuntos
Gradação de Tumores , Neoplasias da Próstata/patologia , Algoritmos , Biópsia , Estudos de Coortes , Humanos , Masculino , Neoplasias da Próstata/diagnóstico , Reprodutibilidade dos Testes
20.
Med Image Anal ; 75: 102274, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-34731777

RESUMO

Supervised deep learning models have proven to be highly effective in classification of dermatological conditions. These models rely on the availability of abundant labeled training examples. However, in the real-world, many dermatological conditions are individually too infrequent for per-condition classification with supervised learning. Although individually infrequent, these conditions may collectively be common and therefore are clinically significant in aggregate. To prevent models from generating erroneous outputs on such examples, there remains a considerable unmet need for deep learning systems that can better detect such infrequent conditions. These infrequent 'outlier' conditions are seen very rarely (or not at all) during training. In this paper, we frame this task as an out-of-distribution (OOD) detection problem. We set up a benchmark ensuring that outlier conditions are disjoint between the model training, validation, and test sets. Unlike traditional OOD detection benchmarks where the task is to detect dataset distribution shift, we aim at the more challenging task of detecting subtle differences resulting from a different pathology or condition. We propose a novel hierarchical outlier detection (HOD) loss, which assigns multiple abstention classes corresponding to each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. We demonstrate that the proposed HOD loss based approach outperforms leading methods that leverage outlier data during training. Further, performance is significantly boosted by using recent representation learning methods (BiT, SimCLR, MICLe). Further, we explore ensembling strategies for OOD detection and propose a diverse ensemble selection process for the best result. We also perform a subgroup analysis over conditions of varying risk levels and different skin types to investigate how OOD performance changes over each subgroup and demonstrate the gains of our framework in comparison to baseline. Furthermore, we go beyond traditional performance metrics and introduce a cost matrix for model trust analysis to approximate downstream clinical impact. We use this cost matrix to compare the proposed method against the baseline, thereby making a stronger case for its effectiveness in real-world scenarios.


Assuntos
Dermatologia , Benchmarking , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...